LEAGUE OF LEGENDS DATA ANALYSIS¶

By: Ethan Baxter Cota

Site Link: https://ethancota.github.io/LeagueOfLegendsDataModeling/

Import Statements¶

In [ ]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from plotly.subplots import make_subplots
import plotly.graph_objects as go

from dsc80_utils import *

#'prettify' console output
from tabulate import tabulate


#sklearn imports
# from sklearn import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import kstest,ttest_ind,ks_2samp

1: Introduction¶

Background¶

There is a lot of interesting data in this dataset. It is really important that we first understand the data that we have. First it is important to understand that there is really three different groupings that the data is pertaining to, the match, the teams, and each individuals. There is categorical data as well as quantitative data for what happened in each match.

Obviously because this has to do with a competition, the obvious questions are all related to performance, who is the best performing player, who is the best performing teams and in what years? What are the most important statistics to look at to figure out how to best define the performance of these categories?

Investigative Question¶

For this project, I will focus on the outcome of the matches.

  • Which variables have the most influence on the outcome of each League of Legends match?
In [ ]:
# Initializing the match data into a full dataframe

FOLDER = 'OE Public Match Data'
PATHS = [f'.\\{FOLDER}\\{i}.csv' for i in np.arange(2014,2025,1)]

def init_data():
    df = pd.concat([pd.read_csv(p) for p in PATHS], ignore_index=True)
    return df

csv_df = init_data()
C:\Users\ethan\AppData\Local\Temp\ipykernel_8696\821623590.py:7: DtypeWarning:

Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Users\ethan\AppData\Local\Temp\ipykernel_8696\821623590.py:7: DtypeWarning:

Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.

Dataset size, number of Matches¶

In [ ]:
game_id_value_counts = csv_df['gameid'].value_counts()
rows_per_game = game_id_value_counts.iloc[0]
games = len(game_id_value_counts)

df_size = csv_df.shape

df_size[0]/12
Out[ ]:
76840.0

2: Data Cleaning and Exploratory Data Analysis (TODO)¶

Data Cleaning¶

Helper Fns¶
In [ ]:
def unq_val_dict(dataframe):
    d1,d2 = {},{}
    for column in dataframe.columns:
        unq_vals = dataframe[column].unique()
        num_unq_vals = len(unq_vals)
        d1[column] = num_unq_vals
        if num_unq_vals < 30:
            d2[column] = unq_vals

    for k,v in d2.items():
        d1.pop(k)
    d1 = dict(sorted(d1.items(), key=lambda x: x[1], reverse=True))
    return d1,d2


def nan_vals(dataframe):
    d1 = {}
    for column in dataframe.columns:
        total_values = len(dataframe[column])
        mv = dataframe[column].isnull().sum()
        mv_prop = np.round((mv / total_values) * 100,2)
        zv = (dataframe[column] == 0).sum()
        zv_prop = np.round((zv / total_values) * 100,2)
        d1[column] = (mv_prop,zv_prop)
    d1 = dict(sorted(d1.items(), key=lambda x: (x[1][0],x[1][1]), reverse=True))

    table = []
    for k, v in d1.items():
        table.append([k, v[0], v[1]])
    headers = ["Column", "NaN %", "Zero %"]
    print(tabulate(table, headers, tablefmt="pretty"))

List of Columns for Dataset¶

In [ ]:
#MOST GAMES GO TO ABOUT 20 MINUTES BUT IT COULD BE ANYWHERE FROM 12 TO 60.  IDK IF YOURE ONLY ANALYZING PRO GAMES, BUT IF NOT, YOURE ALLOWED TO SURRENDER AT 15'.  PROS CANT sURRENDER/FORFEIT
TIME_COLS = [
    'xpat10', 'opp_xpat10', 'xpdiffat10', 'xpat15', 'opp_xpat15', 'xpdiffat15',
    'csat10', 'opp_csat10', 'csdiffat10', 'csat15', 'csdiffat15', 'opp_csat15',
    'killsat10', 'opp_killsat10', 'killsat15', 'opp_killsat15',
    'assistsat10', 'opp_assistsat10', 'assistsat15', 'opp_assistsat15',
    'deathsat10', 'opp_deathsat10', 'deathsat15', 'opp_deathsat15',
]


'''
INFORMATION REGARDING MONSTERS
- Void Grubs (temporarily in 2017)
- Chemtech Drake (removed in 2022)
Monster introduction year:
2016:
- Drakes (infernal, mountain, ocean, cloud)
- Herald
- Elders
2022: Hextech Drake
'''
#Monster kill columns
MONSTER_COLS = [
    'dragons (type unknown)',
    'chemtechs', 'hextechs',
    'clouds','mountains','oceans','infernals',

    'heralds', 'elders', 'dragons', 'barons',
    'opp_heralds', 'opp_elders', 'opp_dragons', 'opp_barons',

    'elementaldrakes',  'opp_elementaldrakes',
    'turretplates',     'opp_turretplates',
    'inhibitors',       'opp_inhibitors',
    'void_grubs',       'opp_void_grubs',
]

MATCH_COLS = [
    'datacompleteness',
    'gameid', 'url',
    'league',
    #time info
    'year', 'split', 'date', 'patch',
    #game data
    'gamelength',   # length of game in seconds
    'playoffs',     # 0/1 playoff game
    'result',       # win/loss
    'game'          # num game played
]

TEAM_COLS = [
    'teamname', 'teamid',
    #team picks and bans
    'ban1', 'ban2', 'ban3', 'ban4', 'ban5',
    'pick1', 'pick2', 'pick3', 'pick4', 'pick5',

    # Team K/D
    'teamkills', 'team kpm',
    'teamdeaths', 'ckpm', #creep kills

    'gspd', #average gold spent percent difference (wtf??)
    'gpr', #gold percentage rating (relative to 50%)
]

PLAYER_COLS = [
    'participantid',
    'side', #RED VS BLUE!
    'position',
    'playername',
    'playerid',

    'champion', #ME

]

GAME_STATS_COLS = [
    #kda info
    'kills', 'deaths', 'assists',
    #multikills
    'doublekills', 'triplekills', 'quadrakills', 'pentakills',
    #First blood = bonus 100 gold
    'firstblood', 'firstbloodkill', 'firstbloodassist', 'firstbloodvictim',

    'damagetochampions', #RESPAWN AND STOP SENDING SUPERS, SO KILLING AN INHIB HELPS SIEGE THE ENEMY BUT ALSO GIVES THEM GOLD.  IF YOU KILL ALL INHIBS AT ONCE THEN EACH WAVE WILL SEND 2 SUPERS
    'dpm',
    'damageshare',
    'damagetakenperminute',
    'damagemitigatedperminute', #DAMAGE YOU BLOCKED WITH SHIELDING
    'wardsplaced', #THE WHOLE MAP IS NATURALLY DARK ON THE MINIMAP UNLESS THERES A TOWER, ALLY CHAMPION, OR MINION SOMEWHERE.  FREE WARDS PROVIDE TEMPORARY VISION WHEN YOU PLACE THEM DOWN
    'wpm', # EVERYONE GETS A FREE WARD EVERY 2 MINUTES, OR YOU CAN TRADE YOUR FREE WARDS FOR A WARD SWEEPER WITH A 90 SECOND COOLDOWN.  FREE WARDS TURN INVISIBLE SO YOU NEED TO SWEEP THEM.
    'wardskilled',
    'wcpm',
    'controlwardsbought', #CONTROL WARDS COST 75 GOLD AND LAST UNTIL THEYRE KILLED BY THE ENEMY, THEY ARE NOT INVISIBLE.  IF THERES AN ENEMY FREE WARD BY A CONTROL WARD, THE FREE WARD IS DISABLED.
    'visionscore', #YOU GET VISION SCORE FOR PLACING AND DESTROYING WARDS
    'vspm',
    #Gold stats
    'totalgold', 'earnedgold', 'goldspent',
    'earned gpm', 'earnedgoldshare',

    #CS stats: kills of neutral monsters/minions
    #CS gives experience
    'total cs', 'cspm', 'minionkills', 'monsterkills',
    'monsterkillsownjungle', 'monsterkillsenemyjungle',
]
In [ ]:
OPP_COLS = [s for s in list(csv_df.columns) if s.startswith('opp_')]
GARBAGE_COLS = [
    'monsterkillsenemyjungle', 'monsterkillsownjungle', 'void grubs'
]

TIME_COLS = [
    'xpat10', 'opp_xpat10', 'xpdiffat10', 'xpat15', 'opp_xpat15', 'xpdiffat15',
    'csat10', 'opp_csat10', 'csdiffat10', 'csat15', 'csdiffat15', 'opp_csat15',
    'killsat10', 'opp_killsat10', 'killsat15', 'opp_killsat15',
    'assistsat10', 'opp_assistsat10', 'assistsat15', 'opp_assistsat15',
    'deathsat10', 'opp_deathsat10', 'deathsat15', 'opp_deathsat15',
    'goldat10', 'opp_goldat10', 'golddiffat10', 'goldat15', 'opp_goldat15', 'golddiffat15',
]

FIRST_COLS = [
    'firstblood',
    'firstbloodkill',
    'firstbloodassist',
    'firstbloodvictim',
    'firstdragon',
    'firstherald',
    'firstbaron',
    'firsttower',
    'firstmidtower',
    'firsttothreetowers'
]

#IMPUTATIONS

#TIME
for col in TIME_COLS:
    csv_df[col] = csv_df[col].fillna(csv_df[col].mean())
#SELECTIONS
SEL_COLS = ['pick1', 'pick2', 'pick3', 'pick4', 'pick5'] + ['ban1', 'ban2', 'ban3', 'ban4', 'ban5']
csv_df[SEL_COLS] = csv_df[SEL_COLS].fillna('none')
#MONSTERS
for col in MONSTER_COLS:
    csv_df[col] = csv_df[col].fillna(0)
#FIRST
for col in FIRST_COLS:
    csv_df[col] = csv_df[col].fillna(0)

#drop opp cols
# csv_df = csv_df.drop(columns=OPP_COLS)



# a = csv_df[csv_df['monsterkillsownjungle'].isna()]
# a[['monsterkillsownjungle','monsterkills']]


csv_df['multikills'] = csv_df[['doublekills', 'triplekills', 'quadrakills', 'pentakills']].sum(axis=1)
objective_control = [
    'dragons (type unknown)', 'chemtechs', 'hextechs',
    'clouds', 'mountains', 'oceans', 'infernals',
    'heralds', 'elders', 'dragons', 'barons',
    'elementaldrakes', 'turretplates', 'inhibitors',
]
csv_df['neutral_kills'] = csv_df[objective_control].sum(axis=1)


nan_vals(csv_df)
C:\Users\ethan\AppData\Local\Temp\ipykernel_8696\3780977128.py:52: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

C:\Users\ethan\AppData\Local\Temp\ipykernel_8696\3780977128.py:59: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

+--------------------------+-------+--------+
|          Column          | NaN % | Zero % |
+--------------------------+-------+--------+
|           gpr            | 85.29 |  0.02  |
|           gspd           | 84.08 |  0.0   |
|          towers          | 83.33 |  0.97  |
|        opp_towers        | 83.33 |  0.97  |
| monsterkillsenemyjungle  | 31.6  |  30.7  |
|  monsterkillsownjungle   | 31.6  | 14.93  |
|           url            | 29.84 |  0.0   |
|          split           | 21.39 |  0.0   |
|         playerid         | 17.47 |  0.0   |
|       damageshare        | 16.78 |  0.0   |
|     earnedgoldshare      | 16.78 |  0.0   |
|        playername        | 16.67 |  0.0   |
|         champion         | 16.67 |  0.0   |
|         total cs         | 15.34 |  0.01  |
| damagemitigatedperminute | 12.1  | 10.76  |
|       triplekills        | 11.33 | 81.57  |
|       doublekills        | 11.33 | 61.58  |
|        pentakills        | 8.62  | 91.12  |
|       quadrakills        | 8.62  | 90.08  |
|       visionscore        | 8.47  | 10.86  |
|           vspm           | 8.47  | 10.86  |
|        goldspent         | 4.46  |  0.0   |
|       minionkills        | 2.09  |  0.01  |
|           cspm           | 1.96  |  0.01  |
|          teamid          | 1.06  |  0.0   |
|          patch           | 0.93  |  0.0   |
|        earnedgold        | 0.14  |  0.02  |
|        earned gpm        | 0.14  |  0.02  |
|        totalgold         | 0.14  |  0.0   |
|       monsterkills       | 0.13  | 19.13  |
|    controlwardsbought    | 0.13  |  1.89  |
|           wcpm           | 0.13  |  1.14  |
|       wardskilled        | 0.13  |  1.13  |
|       wardsplaced        | 0.13  |  1.12  |
|           wpm            | 0.13  |  1.12  |
|    damagetochampions     | 0.13  |  0.0   |
|           dpm            | 0.13  |  0.0   |
|   damagetakenperminute   | 0.13  |  0.0   |
|           game           | 0.07  |  0.0   |
|         teamname         | 0.02  |  0.0   |
|          gameid          | 0.01  |  0.0   |
|        void_grubs        |  0.0  | 99.47  |
|      opp_void_grubs      |  0.0  | 99.47  |
|        chemtechs         |  0.0  | 99.23  |
|          elders          |  0.0  |  99.1  |
|        opp_elders        |  0.0  |  99.1  |
|         hextechs         |  0.0  |  98.5  |
|  dragons (type unknown)  |  0.0  | 97.75  |
|     elementaldrakes      |  0.0  | 95.51  |
|   opp_elementaldrakes    |  0.0  | 95.51  |
|          clouds          |  0.0  |  95.2  |
|          oceans          |  0.0  | 95.18  |
|        mountains         |  0.0  | 95.14  |
|        infernals         |  0.0  | 95.12  |
|       turretplates       |  0.0  | 94.96  |
|     opp_turretplates     |  0.0  | 94.96  |
|       firstherald        |  0.0  | 93.34  |
|        firstbaron        |  0.0  | 93.06  |
|     firstbloodvictim     |  0.0  | 92.65  |
|       firstdragon        |  0.0  | 92.65  |
|      firstmidtower       |  0.0  | 92.65  |
|    firsttothreetowers    |  0.0  | 92.65  |
|        firsttower        |  0.0  | 92.36  |
|         heralds          |  0.0  |  92.1  |
|       opp_heralds        |  0.0  |  92.1  |
|      firstbloodkill      |  0.0  | 91.69  |
|     firstbloodassist     |  0.0  |  89.4  |
|          barons          |  0.0  | 87.92  |
|        opp_barons        |  0.0  | 87.92  |
|         dragons          |  0.0  | 86.78  |
|       opp_dragons        |  0.0  | 86.78  |
|         playoffs         |  0.0  | 84.46  |
|        inhibitors        |  0.0  |  81.2  |
|      opp_inhibitors      |  0.0  |  81.2  |
|        firstblood        |  0.0  | 74.51  |
|        multikills        |  0.0  | 72.88  |
|      neutral_kills       |  0.0  |  72.2  |
|        killsat10         |  0.0  |  55.6  |
|      opp_killsat10       |  0.0  |  55.6  |
|        deathsat10        |  0.0  | 53.71  |
|      opp_deathsat10      |  0.0  | 53.71  |
|          result          |  0.0  | 50.04  |
|       assistsat10        |  0.0  | 49.43  |
|     opp_assistsat10      |  0.0  | 49.43  |
|        killsat15         |  0.0  | 40.81  |
|      opp_killsat15       |  0.0  | 40.81  |
|        deathsat15        |  0.0  | 37.29  |
|      opp_deathsat15      |  0.0  | 37.29  |
|       assistsat15        |  0.0  | 30.03  |
|     opp_assistsat15      |  0.0  | 30.03  |
|          kills           |  0.0  |  17.1  |
|        csdiffat10        |  0.0  | 15.26  |
|        csdiffat15        |  0.0  | 14.01  |
|        xpdiffat10        |  0.0  |  11.9  |
|       golddiffat10       |  0.0  | 11.86  |
|       golddiffat15       |  0.0  | 11.82  |
|        xpdiffat15        |  0.0  | 11.82  |
|          deaths          |  0.0  |  9.93  |
|         assists          |  0.0  |  3.94  |
|          csat10          |  0.0  |  0.48  |
|        opp_csat10        |  0.0  |  0.48  |
|        teamkills         |  0.0  |  0.44  |
|         team kpm         |  0.0  |  0.44  |
|        teamdeaths        |  0.0  |  0.43  |
|          csat15          |  0.0  |  0.06  |
|        opp_csat15        |  0.0  |  0.06  |
|     datacompleteness     |  0.0  |  0.0   |
|          league          |  0.0  |  0.0   |
|           year           |  0.0  |  0.0   |
|           date           |  0.0  |  0.0   |
|      participantid       |  0.0  |  0.0   |
|           side           |  0.0  |  0.0   |
|         position         |  0.0  |  0.0   |
|           ban1           |  0.0  |  0.0   |
|           ban2           |  0.0  |  0.0   |
|           ban3           |  0.0  |  0.0   |
|           ban4           |  0.0  |  0.0   |
|           ban5           |  0.0  |  0.0   |
|          pick1           |  0.0  |  0.0   |
|          pick2           |  0.0  |  0.0   |
|          pick3           |  0.0  |  0.0   |
|          pick4           |  0.0  |  0.0   |
|          pick5           |  0.0  |  0.0   |
|        gamelength        |  0.0  |  0.0   |
|           ckpm           |  0.0  |  0.0   |
|         goldat10         |  0.0  |  0.0   |
|          xpat10          |  0.0  |  0.0   |
|       opp_goldat10       |  0.0  |  0.0   |
|        opp_xpat10        |  0.0  |  0.0   |
|         goldat15         |  0.0  |  0.0   |
|          xpat15          |  0.0  |  0.0   |
|       opp_goldat15       |  0.0  |  0.0   |
|        opp_xpat15        |  0.0  |  0.0   |
+--------------------------+-------+--------+

VARIABLE ANALYSIS¶

UNIVARIATE¶
In [ ]:
time_cols_to_plot = [
    'xpat10', 'xpat15',
]

# Plot boxplots for the 10-minute and 15-minute marks
for col in time_cols_to_plot:
    fig = px.box(csv_df, y=col, title=f'Boxplot of {col}', labels={col: col})
    fig.show()
In [ ]:
#create a logorithmic histogram of neutral monster kills
fig = px.histogram(csv_df, x='neutral_kills', log_y=True,
                   title='Logarithmic Histogram of Neutral Kills',
                   labels={'neutral_kills': 'Neutral Kills'})
fig.show()

COMPARING CREEP SCORE WITH MONSTER SKILLS¶

In [ ]:
#create a 50000 sample of the df
sample_df = csv_df.sample(50000)

fig = px.scatter(sample_df, x='monsterkills', y='total cs',
                 title='Scatter Plot of Monster Kills vs Total CS',
                 labels={'monsterkills': 'Monster Kills', 'total cs': 'Total CS'})

fig.show()

COMPARING EARLY GAME SCORES AND WIN/LOSS¶

In [ ]:
sdf = csv_df[['xpdiffat15','csdiffat15','result']].sample(5000)

fig = px.scatter(sdf, x='xpdiffat15', y='csdiffat15', color='result',
                 color_discrete_map={1: 'green', 0: 'red'},
                 title='Scatter Plot of XP Difference at 15 vs CS Difference at 15',
                 labels={'xpdiffat15': 'XP Difference at 15', 'csdiffat15': 'CS Difference at 15'})

# fig.show()
In [ ]:
sdf = csv_df[csv_df['gamelength']<20000][['kills','gamelength','result']].sample(5000)

fig = px.scatter(sdf, x='gamelength', y='kills', color='result',
                 color_discrete_map={1: 'green', 0: 'red'},
                 title='Scatter Plot of kills and time',)

# fig.show()

Step 3: Assessment of Missingness¶

In [ ]:
def permutation_test(data, n=10):
    #catch empty data
    if len(data[0]) == 0 or len(data[1]) == 0:
        return -1

    observed_stat, _ = ks_2samp(data[0], data[1])
    combined = np.concatenate(data)
    count = 0
    for _ in range(n):
        np.random.shuffle(combined)
        permuted_data = (combined[:len(data[0])], combined[len(data[0]):])
        permuted_stat, _ = ks_2samp(permuted_data[0], permuted_data[1])
        if permuted_stat >= observed_stat:
            count += 1
    return count / n

def missingness_dependency_test_all_cols(m_col):
    missingness_column = m_col #'xpat10'
    columns_to_test = [col for col in csv_df.columns if col != missingness_column]

    d = {}
    col_count = len(csv_df.columns)
    counter = 1
    for col in columns_to_test:

        counter+=1

        #dont check non-numeric columns
        if pd.api.types.is_numeric_dtype(csv_df[col]):

            missing = csv_df[missingness_column].isna()
            present = ~missing

            print(csv_df[col][missing].dropna().shape)

            data = (csv_df[col][missing].dropna(), csv_df[col][present].dropna())
            p_value = permutation_test(data)
            d[col] = p_value


    #sort d by vals
    d = dict(sorted(d.items(), key=lambda x: x[1], reverse=True))

    #use tabulate to print values
    table = []
    for k,v in d.items():
        table.append([k,v])
    headers = ["Column", "P-Value"]
    print(tabulate(table, headers, tablefmt="pretty"))

missingness_dependency_test_all_cols('xpat10')
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
(0,)
+--------------------------+---------+
|          Column          | P-Value |
+--------------------------+---------+
|           year           |   -1    |
|         playoffs         |   -1    |
|           game           |   -1    |
|          patch           |   -1    |
|      participantid       |   -1    |
|        gamelength        |   -1    |
|          result          |   -1    |
|          kills           |   -1    |
|          deaths          |   -1    |
|         assists          |   -1    |
|        teamkills         |   -1    |
|        teamdeaths        |   -1    |
|       doublekills        |   -1    |
|       triplekills        |   -1    |
|       quadrakills        |   -1    |
|        pentakills        |   -1    |
|        firstblood        |   -1    |
|      firstbloodkill      |   -1    |
|     firstbloodassist     |   -1    |
|     firstbloodvictim     |   -1    |
|         team kpm         |   -1    |
|           ckpm           |   -1    |
|       firstdragon        |   -1    |
|         dragons          |   -1    |
|       opp_dragons        |   -1    |
|     elementaldrakes      |   -1    |
|   opp_elementaldrakes    |   -1    |
|        infernals         |   -1    |
|        mountains         |   -1    |
|          clouds          |   -1    |
|          oceans          |   -1    |
|        chemtechs         |   -1    |
|         hextechs         |   -1    |
|  dragons (type unknown)  |   -1    |
|          elders          |   -1    |
|        opp_elders        |   -1    |
|       firstherald        |   -1    |
|         heralds          |   -1    |
|       opp_heralds        |   -1    |
|        void_grubs        |   -1    |
|      opp_void_grubs      |   -1    |
|        firstbaron        |   -1    |
|          barons          |   -1    |
|        opp_barons        |   -1    |
|        firsttower        |   -1    |
|          towers          |   -1    |
|        opp_towers        |   -1    |
|      firstmidtower       |   -1    |
|    firsttothreetowers    |   -1    |
|       turretplates       |   -1    |
|     opp_turretplates     |   -1    |
|        inhibitors        |   -1    |
|      opp_inhibitors      |   -1    |
|    damagetochampions     |   -1    |
|           dpm            |   -1    |
|       damageshare        |   -1    |
|   damagetakenperminute   |   -1    |
| damagemitigatedperminute |   -1    |
|       wardsplaced        |   -1    |
|           wpm            |   -1    |
|       wardskilled        |   -1    |
|           wcpm           |   -1    |
|    controlwardsbought    |   -1    |
|       visionscore        |   -1    |
|           vspm           |   -1    |
|        totalgold         |   -1    |
|        earnedgold        |   -1    |
|        earned gpm        |   -1    |
|     earnedgoldshare      |   -1    |
|        goldspent         |   -1    |
|           gspd           |   -1    |
|           gpr            |   -1    |
|         total cs         |   -1    |
|       minionkills        |   -1    |
|       monsterkills       |   -1    |
|  monsterkillsownjungle   |   -1    |
| monsterkillsenemyjungle  |   -1    |
|           cspm           |   -1    |
|         goldat10         |   -1    |
|          csat10          |   -1    |
|       opp_goldat10       |   -1    |
|        opp_xpat10        |   -1    |
|        opp_csat10        |   -1    |
|       golddiffat10       |   -1    |
|        xpdiffat10        |   -1    |
|        csdiffat10        |   -1    |
|        killsat10         |   -1    |
|       assistsat10        |   -1    |
|        deathsat10        |   -1    |
|      opp_killsat10       |   -1    |
|     opp_assistsat10      |   -1    |
|      opp_deathsat10      |   -1    |
|         goldat15         |   -1    |
|          xpat15          |   -1    |
|          csat15          |   -1    |
|       opp_goldat15       |   -1    |
|        opp_xpat15        |   -1    |
|        opp_csat15        |   -1    |
|       golddiffat15       |   -1    |
|        xpdiffat15        |   -1    |
|        csdiffat15        |   -1    |
|        killsat15         |   -1    |
|       assistsat15        |   -1    |
|        deathsat15        |   -1    |
|      opp_killsat15       |   -1    |
|     opp_assistsat15      |   -1    |
|      opp_deathsat15      |   -1    |
|        multikills        |   -1    |
|      neutral_kills       |   -1    |
+--------------------------+---------+

Step 4: Permutation Testing¶

In [ ]:
from scipy.stats import ks_2samp
import plotly.express as px


test_df = csv_df[['result', 'neutral_kills']]
killsW = test_df[test_df['result'] == 1]['neutral_kills']
killsL = test_df[test_df['result'] == 0]['neutral_kills']

observed_ks_stat, _ = ks_2samp(killsW, killsL)


n = 100
all_kills = np.concatenate([killsW, killsL])
n_win = len(killsW)

p_ks_stats = np.zeros(n)

for i in range(n):
    np.random.shuffle(all_kills)
    p_killsW = all_kills[:n_win]
    p_killsL = all_kills[n_win:]
    p_ks_stat, _ = ks_2samp(p_killsW, p_killsL)
    p_ks_stats[i] = p_ks_stat

fig = px.histogram(pd.DataFrame(p_ks_stats), x=0, nbins=20, histnorm='probability', title='K-S Stat Distribution')
fig.show()

# pval
p_value = (np.array(p_ks_stats) >= observed_ks_stat).mean()

print(f"Observed K-S Stat: {observed_ks_stat}")
print(f"P-value (reject at 0.05): {p_value}")
Observed K-S Stat: 0.23597427758624723
P-value (reject at 0.05): 0.0

Step 5: Framing a Prediction Problem¶

Based of our previous analyses, we know that KDE performance, early game advantages, and controlling tower/monster areas are all significant factors when contributing towards whether a team wins or not. So we will suppose, that based on these metrics that we will define, we are able to predict the outcome of the match.

Step 6: Baseline Model¶

In [ ]:
#early game advantage and objective control
early_advantage = [
    'xpdiffat10', 'xpdiffat15',
    'csdiffat10', 'csdiffat15',
    'firstherald', 'firstdragon', 'firstbaron',
    'firsttower', 'firstmidtower', 'firsttothreetowers',
]

objective_control = [
    'dragons (type unknown)', 'chemtechs', 'hextechs',
    'clouds', 'mountains', 'oceans', 'infernals',
    'heralds', 'elders', 'dragons', 'barons',
    'elementaldrakes', 'turretplates', 'inhibitors',
]

# csv_df['elem'] = csv_df[['clouds', 'mountains', 'oceans', 'infernals']].sum(axis=1)

kde = [
    'multikills',
    'kills', 'deaths', 'assists',
    'killsat10', 'killsat15',
    'assistsat10', 'assistsat15',
    'deathsat10', 'deathsat15'
]

result = ['result']

all_cols = early_advantage + objective_control + kde + result

# [p for p in csv_df['patch'].unique()]

# csv_df[all_cols]
# csv_df[objective_control].count(axis=1).describe()

for col in all_cols:
    if col not in csv_df.columns:
        print(col)

csv_df[early_advantage]
Out[ ]:
xpdiffat10 xpdiffat15 csdiffat10 csdiffat15 ... firstbaron firsttower firstmidtower firsttothreetowers
0 -87.0 -560.0 0.0 -1.0 ... 0.0 0.0 0.0 0.0
1 -1425.0 -703.0 -27.0 -36.0 ... 0.0 0.0 0.0 0.0
2 -87.0 -148.0 5.0 7.0 ... 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
922077 108.0 -77.0 2.0 7.0 ... 0.0 0.0 0.0 0.0
922078 -110.0 2322.0 22.0 45.0 ... 1.0 1.0 1.0 1.0
922079 110.0 -2322.0 -22.0 -45.0 ... 0.0 0.0 0.0 0.0

922080 rows × 10 columns

In [ ]:
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression


'''
ITERATE THROUGH DEGREES:
    MAKE PIPELINE WITH POLYNOMIAL FEATURES
    CROSSVALSCORE
'''

test_df = csv_df[all_cols]

x = test_df.drop(columns=['result'])
y = test_df['result']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=1)


select = FunctionTransformer(lambda x: x)
pipes = {
    'objective_control': make_pipeline(
        make_column_transformer( (select, ['neutral_kills']) ),
        LogisticRegression(),
    ),
    'early_game_advantage': make_pipeline(
        make_column_transformer( (select, early_advantage) ),
        LogisticRegression(),
    ),
    'kde': make_pipeline(
        make_column_transformer( (select, kde) ),
        LogisticRegression(),
    ),
}

pipe_df = pd.DataFrame()

# for pipe in pipes:
#     errs = cross_val_score(pipes[pipe], x_train, y_train, cv=3, scoring='neg_root_mean_squared_error')
#     pipe_df[pipe] = -errs

# pipe_df.index = [f'Fold {i}' for i in range(1, 6)]
# pipe_df.index.name = 'Validation Fold'
In [ ]:
 
Out[ ]:
neutral_kills monsterkills
0 0.0 0.0
1 1.0 91.0
2 0.0 29.0
3 1.0 32.0
4 0.0 0.0
... ... ...
922075 0.0 0.0
922076 0.0 4.0
922077 0.0 0.0
922078 21.0 176.0
922079 10.0 112.0

922080 rows × 2 columns

Step 7: Final Model¶

Step 8: Fairness Analysis¶